Assignment 9: GBDT

Response Coding: Example

The response tabel is built only on train dataset. For a category which is not there in train data and present in test data, we will encode them with default values Ex: in our test data if have State: D then we encode it as [0.5, 0.05]

  1. Apply GBDT on these feature sets
    • Set 1: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF)+ preprocessed_eassay (TFIDF)+sentiment Score of eassay(check the bellow example, include all 4 values as 4 features)
    • Set 2: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF W2V)+ preprocessed_eassay (TFIDF W2V)
    • Here in response encoding you need to apply the laplase smoothing value for test set. Laplase smoothing means, If test point is present in test but not in train then you need to apply default 0.5 as probability value for that data point (Refer the Response Encoding Image from above cell)
    • Please use atleast 35k data points
  2. The hyper paramter tuning (Consider any two hyper parameters)
    • Find the best hyper parameter which will give the maximum AUC value
    • find the best hyper paramter using k-fold cross validation/simple cross validation data
    • use gridsearch cv or randomsearch cv or you can write your own for loops to do this task
  3. Representation of results
    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure with X-axis as n_estimators, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynb
    • or


    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure seaborn heat maps with rows as n_estimators, columns as max_depth, and values inside the cell representing AUC Score
    • You choose either of the plotting techniques out of 3d plot or heat map
    • Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test. Make sure that you are using predict_proba method to calculate AUC curves, because AUC is calcualted on class probabilities and not on class labels.
    • Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points

  4. You need to summarize the results at the end of the notebook, summarize it in the table format

Few Notes

  1. Use atleast 35k data points
  2. Use classifier.Predict_proba() method instead of predict() method while calculating roc_auc scores
  3. Be sure that you are using laplase smoothing in response encoding function. Laplase smoothing means applying the default (0.5) value to test data if the test data is not present in the train set

1. GBDT (xgboost/lightgbm)

1.1 Loading Data

1.2 Splitting data into Train and cross validation(or test): Stratified Sampling

1.3 Make Data Model Ready: encoding numerical, categorical features

1.4 Make Data Model Ready: encoding eassay, and project_title

1.5 Appling Models on different kind of featurization as mentioned in the instructions


Apply GBDT on different kind of featurization as mentioned in the instructions
For Every model that you work on make sure you do the step 2 and step 3 of instrucations

3. Summary


as mentioned in the step 4 of instructions